{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "afd55886-5f5b-4794-838e-ef8179fb0394",
   "metadata": {},
   "source": [
    "##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n",
    "```\n",
    "make venv \n",
    "source venv/bin/activate \n",
    "pip install jupyterlab\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4c45c3c6-e4d7-4e61-8de6-32d61f2ce695",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "## This is here as a reference only\n",
    "# Users and application developers must use the right tag for the latest from pypi\n",
    "%pip install \"data-prep-toolkit-transforms[pii_redactor]==1.0.0a5\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "407fd4e4-265d-4ec7-bbc9-b43158f5f1f3",
   "metadata": {},
   "source": [
    "##### **** Configure the transform parameters. \n",
    "```\n",
    " --pii_redactor_entities PII_ENTITIES\n",
    "                        list of PII entities to be captured for example: [\"PERSON\", \"EMAIL\"]\n",
    " --pii_redactor_operator REDACTOR_OPERATOR\n",
    "                        Two redaction techniques are supported - replace(default), redact \n",
    "  --pii_redactor_transformed_contents PII_TRANSFORMED_CONTENT_COLUMN_NAME\n",
    "                        Mention the column name in which transformed contents will be added. This is required argument. \n",
    "  --pii_redactor_score_threshold SCORE_THRESHOLD\n",
    "                        The score_threshold is a parameter that sets the minimum confidence score required for an entity to be considered a match. Provide a value above 0.6\n",
    "```\n",
    "#####"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ebf1f782-0e61-485c-8670-81066beb734c",
   "metadata": {},
   "source": [
    "##### ***** Import required classes and modules"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "c2a12abc-9460-4e45-8961-873b48a9ab19",
   "metadata": {},
   "outputs": [],
   "source": [
    "from dpk_pii_redactor import PIIRedactor"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7234563c-2924-4150-8a31-4aec98c1bf33",
   "metadata": {},
   "source": [
    "##### ***** Setup runtime parameters and invoke the transform"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "95737436",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "17:17:13 INFO - pipeline id pipeline_id\n",
      "17:17:13 INFO - code location None\n",
      "17:17:13 INFO - data factory data_ is using local data access: input_folder - ray/test-data/input output_folder - output\n",
      "17:17:13 INFO - data factory data_ max_files -1, n_sample -1\n",
      "17:17:13 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n",
      "17:17:13 INFO - orchestrator pii_redactor started at 2025-01-16 17:17:13\n",
      "17:17:13 INFO - Number of files is 1, source profile {'max_file_size': 0.0023164749145507812, 'min_file_size': 0.0023164749145507812, 'total_file_size': 0.0023164749145507812}\n",
      "17:17:13 INFO - Loading model from flair/ner-english-large\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2025-01-16 17:17:23,474 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "17:17:24 INFO - Completed 1 files (100.0%) in 0.005 min\n",
      "17:17:24 INFO - Done processing 1 files, waiting for flush() completion.\n",
      "17:17:24 INFO - done flushing in 0.0 sec\n",
      "17:17:24 INFO - Completed execution in 0.177 min, execution result 0\n"
     ]
    }
   ],
   "source": [
    "%%capture\n",
    "PIIRedactor(input_folder='ray/test-data/input',\n",
    "            output_folder= 'output',\n",
    "            pii_redactor_entities = [\"PERSON\", \"EMAIL_ADDRESS\"],\n",
    "            pii_redactor_operator = \"replace\",\n",
    "            pii_redactor_transformed_contents = \"title\").transform()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3df5adf-4717-4a03-864d-9151cd3f134b",
   "metadata": {},
   "source": [
    "##### **** The specified folder will include the transformed parquet files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "7276fe84-6512-4605-ab65-747351e13a7c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['output/metadata.json', 'output/test1.parquet']"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import glob\n",
    "glob.glob(\"output/*\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "845a75cf-f4a9-467d-87fa-ccbac1c9beb8",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
